Monitoring Infrastructure

Overview

In this project, I worked on monitoring applications and infrastructure using AWS services. The ability to monitor applications and infrastructure is critical for delivering reliable, consistent IT services.

Monitoring requirements range from collecting statistics for long-term analysis to quickly reacting to changes and outages. Monitoring can also support compliance reporting by continuously checking that infrastructure is meeting organizational standards.

I learned how to use several AWS monitoring tools:

Amazon CloudWatch Metrics
Amazon CloudWatch Logs
Amazon CloudWatch Events
AWS Config

By the end, I successfully:

Used AWS Systems Manager Run Command to install the CloudWatch agent on Amazon Elastic Compute Cloud (Amazon EC2) instances
Monitored application logs using CloudWatch agent and CloudWatch Logs
Monitored system metrics using CloudWatch agent and CloudWatch Metrics
Created real-time notifications using CloudWatch Events
Tracked infrastructure compliance using AWS Config

Task 1: Installing the CloudWatch Agent

I started by installing the CloudWatch agent on an EC2 instance. The CloudWatch agent is really versatile - it can collect metrics from both EC2 instances and on-premises servers, including:

System-level metrics from EC2 instances (CPU allocation, free disk space, and memory utilization). These metrics are collected from the machine itself and complement the standard CloudWatch metrics that CloudWatch collects.
System-level metrics from on-premises servers that enable the monitoring of hybrid environments and servers not managed by AWS.
System and application logs from both Linux and Windows servers.
Custom metrics from applications and services using the StatsD and collectd protocols.

Here's how I did it:

First, I opened the AWS Management Console and selected Systems Manager from the Services menu.
In the navigation pane, I chose Run Command. I had to click the icon in the top-left corner to make the navigation pane visible.
I clicked "Run a Command" and selected the AWS-ConfigureAWSPackage option (which typically appears toward the top of the list).
Under Command parameters, I set:

Action: Install
Name: AmazonCloudWatchAgent
Version: latest

For Targets, I chose "Choose instances manually" and selected the Web Server instance.
I clicked Run and waited for the Overall status to change to Success, occasionally refreshing the page using the refresh button toward the top of the page.
To verify success, I viewed the output by clicking the expand icon next to the instance under Targets and outputs, then clicked View output.
I expanded Step 1 - Output and saw "Successfully installed arn:aws:ssm:::package/AmazonCloudWatchAgent."

I noticed a message about "Step execution skipped due to unsatisfied preconditions: '"StringEquals": [platformType, Windows]'. Step name: createDownloadFolder" for Windows platforms, but this was expected since I was using a Linux instance, so I safely ignored it. I could select Step 2 - Output instead because the instance was created from a Linux AMI.

Next, I needed to configure the CloudWatch agent to collect web server logs and system metrics. I stored this configuration in AWS Systems Manager Parameter Store:

In the navigation pane, I selected Parameter Store.
I clicked Create parameter and configured:

Name: Monitor-Web-Server
Description: Collect web logs and system metrics
Value: I pasted a JSON configuration that defined:

jsonCopy{ "logs": { "logs_collected": { "files": { "collect_list": [ { "log_group_name": "HttpAccessLog", "file_path": "/var/log/httpd/access_log", "log_stream_name": "{instance_id}", "timestamp_format": "%b %d %H:%M:%S" }, { "log_group_name": "HttpErrorLog", "file_path": "/var/log/httpd/error_log", "log_stream_name": "{instance_id}", "timestamp_format": "%b %d %H:%M:%S" } ] } } }, "metrics": { "metrics_collected": { "cpu": { "measurement": [ "cpu_usage_idle", "cpu_usage_iowait", "cpu_usage_user", "cpu_usage_system" ], "metrics_collection_interval": 10, "totalcpu": false }, "disk": { "measurement": [ "used_percent", "inodes_free" ], "metrics_collection_interval": 10, "resources": [ "*" ] }, "diskio": { "measurement": [ "io_time" ], "metrics_collection_interval": 10, "resources": [ "*" ] }, "mem": { "measurement": [ "mem_used_percent" ], "metrics_collection_interval": 10 }, "swap": { "measurement": [ "swap_used_percent" ], "metrics_collection_interval": 10 } } } }

I examined the configuration and found it defined the following items to be monitored:

Logs: Two web server log files to be collected and sent to CloudWatch Logs
Metrics: CPU, disk, and memory metrics to be sent to CloudWatch Metrics

I clicked Create parameter to store this parameter for reference when starting the CloudWatch agent.

After creating the parameter, I started the CloudWatch agent on the web server:

I went back to Run Command and clicked "Run command".
I filtered for "AmazonCloudWatch-ManageAgent" by:

Selecting Document name prefix
Selecting Equals
Entering AmazonCloudWatch-ManageAgent
Verifying the filter was Document name prefix : Equals : AmazonCloudWatch-ManageAgent
Pressing Enter

Before running it, I viewed the command definition by clicking on AmazonCloudWatch-ManageAgent (the name itself).
A new browser tab opened showing the definition of the command. I browsed through the content of each tab to see how a command document is defined.
I checked the Content tab and scrolled to the bottom to see the actual script that would run on the target instance. The script referenced the AWS Systems Manager Parameter Store to retrieve the CloudWatch agent configuration that I defined earlier.
After closing that tab, I selected AmazonCloudWatch-ManageAgent and configured:

Action: configure
Mode: ec2
Optional Configuration Source: ssm
Optional Configuration Location: Monitor-Web-Server
Optional Restart: yes

For Targets, I chose "Choose instances manually" and selected the Web Server.
I clicked Run and waited for Success status, occasionally refreshing the page.

At this point, the CloudWatch agent was running and sending log and metric data to CloudWatch.

Task 2: Monitoring Application Logs Using CloudWatch Logs

CloudWatch Logs lets me monitor applications and systems using log data. For example, CloudWatch Logs can track the number of errors that occur in application logs and send a notification whenever the rate of errors exceeds a threshold that I specify.

CloudWatch Logs uses existing log data for monitoring, so no code changes are required. For example, I can monitor application logs for specific literal terms (such as "NullReferenceException") or count the number of occurrences of a literal term at a particular position in log data (such as 404 status codes in a web server access log). When the term being searched for is found, CloudWatch Logs reports the data to a CloudWatch metric that I specify. Log data is encrypted while in transit and while it is at rest.

The Web Server generates two types of log data:

Access logs
Error logs

I generated some log data on the Web Server to monitor with CloudWatch Logs:

I clicked the Details dropdown menu above the instructions and clicked Show.
I copied the WebServerIP value and opened it in a new browser tab, which showed a web server Test Page.
To generate log data, I appended "/start" to the URL, which generated a 404 error since the page doesn't exist. This was intentional to generate data in the access logs.
I kept this tab open but returned to the AWS Management Console.
From the Services menu, I opened CloudWatch.
In the navigation pane, I chose Log groups and saw two logs: HttpAccessLog and HttpErrorLog. (If these logs weren't listed, I waited a minute and clicked Refresh.)
I clicked on HttpAccessLog (the name itself) and then selected the Log stream in the table (which had the same ID as the EC2 instance).
I saw log data consisting of GET requests, including information about the computer and browser that made the request. I expanded lines using the arrow to view additional information.
I found a line with my /start request with a code of 404, indicating the page was not found.

This demonstrated how log files can be automatically shipped from an EC2 instance or an on-premises server to CloudWatch Logs, making log data accessible without having to log in to each individual server. Log data can also be collected from multiple servers, such as an Auto Scaling fleet of web servers.

Creating a Metric Filter in CloudWatch Logs

I configured a filter to identify 404 Errors in the log file, which would normally indicate that the web server is generating invalid links that users are choosing:

In Log groups, I selected the checkbox next to HttpAccessLog.
From the Actions dropdown menu, I selected Create metric filter.
I entered this filter pattern: [ip, id, user, timestamp, request, status_code=404, size]
This tells CloudWatch Logs how to interpret the fields in the log data and defines a filter to find lines only with status_code=404.
In the Test pattern section, I used the dropdown menu to select the EC2 instance ID (similar to i-0f07ab62aae4xxxx9).
I clicked Test pattern and then Show test results.
I confirmed I could see at least one result with a $status_code of 404.
I clicked Next and set:

Filter name: 404Errors
Metric namespace: LogMetrics
Metric name: 404Errors
Metric value: 1

I clicked Next (clicking an empty text field first if Next wasn't enabled).
On the Review and create page, I clicked Create metric filter.

This metric filter could now be used in an alarm.

Creating an Alarm Using the Filter

I configured an alarm to notify me when too many 404 errors occur:

In the 404Errors panel, I selected the checkbox in the top-right corner.
In the Metric filters section, I clicked Create alarm.
I configured:

Period: 1 minute
Conditions: Greater/Equal than 5

I clicked Next and created a new SNS topic with my email address.
I clicked Create topic, then Next.
I set:

Alarm name: 404 Errors
Alarm description: Alert when too many 404s detected on an instance

I clicked Next and Create alarm.
I confirmed the subscription by clicking the link in the confirmation email.
Back in CloudWatch, the alarm appeared in orange, indicating "Insufficient data" because no data had been received in the past minute.

To test the alarm, I:

Returned to the web server browser tab. (If it was no longer open, I reopened it using the WebServerIP from the Details menu.)
Attempted to go to non-existent pages at least five times by adding different page names after the IP address (e.g., http://192.0.2.0/start2).
Waited 1-2 minutes for the alarm to trigger, refreshing occasionally.
Confirmed the graph turned red, indicating the Alarm state.
Checked my email and found an alarm notification with the subject "ALARM: 404 Errors".

This demonstrated how to create alarms from application log data and receive alerts for unusual behavior, with the log file accessible within CloudWatch Logs for further analysis.

Task 3: Monitoring Instance Metrics Using CloudWatch

Metrics are data about the performance of systems. CloudWatch stores metrics for the AWS services used, and I can also publish my own application metrics either via the CloudWatch agent or directly from applications. CloudWatch can present the metrics for search, graphs, dashboards, and alarms.

I examined EC2 metrics:

From the Services menu, I chose EC2.
In the navigation pane, I selected Instances.
I selected the Web Server and examined the Monitoring tab in the lower half of the page.
I noted that CloudWatch captures metrics about CPU, disk, and network usage on the instance, viewing it from the outside as a virtual machine.

These metrics don't give insight into what's running inside the instance, such as measuring free memory or free disk space. Fortunately, the CloudWatch agent runs inside the instance to collect these internal metrics.

To view the CloudWatch agent metrics:

From the Services menu, I selected CloudWatch.
In the navigation pane, I chose Metrics, then expanded Metrics and selected All metrics.
I saw various metrics in the lower half of the page, some automatically generated by AWS and others collected by the CloudWatch agent.
I chose CWAgent, then device, fstype, host, path to see disk space metrics.
I clicked CWAgent above the table (in the line showing All > CWAgent > device, fstype, host, path), then chose host to see metrics related to system memory.
I clicked All again and explored other metrics, selecting ones I wanted to appear on the graph.

Task 4: Creating Real-Time Notifications

CloudWatch Events delivers a near-real-time stream of system events describing changes in AWS resources. Simple rules can match events and route them to target functions or streams. CloudWatch Events becomes aware of operational changes as they occur.

CloudWatch Events can respond to operational changes, take corrective action, send messages to respond to the environment, activate functions, make changes, and capture state information. It can also schedule automated actions using cron or rate expressions.

I created a real-time notification for instance state changes:

In CloudWatch, I expanded Events in the navigation pane and chose Rules.
I clicked Create rule.
I configured the Event Source:

Service Name: EC2
Event Type: EC2 Instance State-change Notification
Selected the checkbox for Specific state(s)
From the dropdown menu, selected stopped and terminated

In the Targets section, I:

Clicked Add target
Selected SNS topic from the dropdown menu (instead of Lambda function)
For Topic, selected Default_CloudWatch_Alarms_Topic
I clicked Configure details, named the rule "Instance_Stopped_Terminated", and clicked Create rule.

Configure a Real-Time Notification

I could configure Amazon Simple Notification Service (Amazon SNS) to send notifications to my phone via SMS or to my email. Since configuring SMS messaging requires opening a ticket with AWS Support and takes time to configure, I used email instead.

I noted that more information about configuring SMS messaging with SNS is available in the Amazon Simple Notification Service Developer Guide.

From the Services menu, I chose Simple Notification Service.
In the navigation pane, I chose Topics.
I clicked the link in the Name column and saw a single subscription associated with my email address (the Topic I configured in Task 2).
From the Services menu, I chose EC2.
In the navigation pane, I chose Instances.
I selected the Web Server, clicked Instance state, then Stop instance, and then Stop.
The Web Server entered the Stopping state and then the Stopped state after a minute.
I received an email with details about the stopped instance in JSON format.

I noted that to receive a more readable message, I could create an AWS Lambda function triggered by CloudWatch Events. The Lambda function could format a more readable message and send it via Amazon SNS.

This demonstrated how to receive real-time notifications when infrastructure changes.

Task 5: Monitoring for Infrastructure Compliance

With AWS Config, I can assess, audit, and evaluate the configurations of AWS resources. AWS Config continuously monitors and records AWS resource configurations and allows automated evaluation of recorded configurations against desired configurations.

AWS Config lets me review changes in configurations and relationships between AWS resources, dive into detailed resource configuration histories, and determine overall compliance against configurations specified in internal guidelines. It simplifies compliance auditing, security analysis, change management, and operational troubleshooting.

I set up AWS Config rules:

From the Services menu, I chose Config.
If a Get started button appeared, I completed initial setup by clicking:

Get started
Next
Next
Confirm

This configured AWS Config for initial use, and I closed the Welcome window.
In the navigation pane, I chose Rules (toward the top).
I clicked Add rule and searched for "required-tags" in the AWS Managed Rules section.
I selected required-tags and clicked Next.
Under Parameters, I set tag1Key to "project" (replacing any existing value).
I clicked Next and Add rule.

This rule looks for resources without a project tag. It takes a few minutes to complete, so I continued with the next steps.

I then added a rule to check for unused EBS volumes:

I clicked Add rule and searched for "ec2-volume-inuse-check".
I selected it, clicked Next twice, and Add rule.
I waited for at least one rule to complete evaluation, refreshing if needed.
If I saw "No resources in scope," I waited longer as AWS Config was still scanning resources.
I examined each rule by selecting "Compliant" under Resources in scope.

The results showed:

required-tags: A compliant EC2 instance (Web Server with project tag) and many non-compliant resources without project tags
ec2-volume-inuse-check: One compliant volume (attached to an instance) and one non-compliant volume (not attached)

I learned that AWS Config has a large library of pre-defined compliance checks, and custom checks can be created using Lambda.